149 research outputs found

    Spectral Modelling Synthesis of Vehicle Pass-by Noise

    Get PDF
    Spectral Modelling Synthesis (SMS) is a sound synthesis technique that models time-varying spectra of given sounds as a collection of sinusoids plus a filtered noise component. Although originally utilized to produce musical sounds, this technique can also be extended for analysis, transformation and synthesis of a wide range of environmental sounds, such as traffic noise. Simplifications based on psychoacoustic analysis can be conducted during the modelling process to avoid redundant data, which leads to perceptual similarity between synthesized sounds and the original recordings of vehicle pass-by noise. In this paper, we investigate if this perceptual similarity can be described by objective metrics, and how to improve the synthesis by tuning the parameters in the SMS algorithm. The results showed that vehicle pass-by sounds characterized by tyre and engine noise can be well synthesized with different parameter sets in the SMS algorithm. Furthermore, it is found that Zwicker Roughness is a sensitive metric for measuring the perceptual similarity between original recordings and synthesized sounds as it varies significantly when tuning SMS parameters

    An evaluation of pre-processing techniques for virtual loudspeaker binaural ambisonic rendering

    Get PDF
    International audienceBinaural Ambisonic rendering is widely used in immersive applications such as virtual reality due to its sound field rotation capabilities. Binaural Ambisonic reproduction can theoretically replicate the original sound field exactly for frequencies up to what is commonly referred to as the `spatial aliasing frequency', f_alias. At frequencies above f_alias however, reproduction can become inaccurate due to the limited spatial accuracy of reproducing a physical sound field with a finite number of transducers, which in practice causes localisation blur, reduced lateralisation and comb filtering spectral artefacts. The standard approach to improving Ambisonic reproduction is to increase the order of Ambisonics, which allows for exact sound field reproduction up to a higher f_alias, though at the expense of more channels for storage, more microphone capsules for recording, and more convolutions in binaural reproduction. It is therefore highly desirable to explore alternative methods of improving low-order Ambisonic rendering. One common practice is to employ a dual-band decoder with basic Ambisonic decoding at low frequencies and Max r_E channel weighting above f_alias, which improves spectral, localisation and lateralisation reproduction. Virtual loudspeaker binaural Ambisonic decoders can be made by multiplying each loudspeaker's head-related impulse responses (HRIRs) with the decode matrix coefficients and summing the resulting spherical harmonic (SH) channels. This approach allows for dual-band decoding and loudspeaker configurations with more loudspeakers than SH channels whilst minimising the required number of convolutions. Binaural Ambisonic reproduction using the virtual loudspeaker approach is then achieved by a summation of direct convolution of each SH channel of the encoded signal with the corresponding SH channel of the binaural decoder. This paper presents the method and results of a perceptual comparison of state-of-the-art pre-processing techniques for virtual loudspeaker binaural Ambisonic rendering. By implementing these pre-processing techniques in the HRTFs used in the virtual loudspeaker binaural rendering stage, improvements can be made to the rendering. All pre-processing techniques are implemented offline, such that the resulting binaural decoders are of the same size and require the same number of real-time convolutions.The three pre-processing techniques investigated in this study are:\beginitemize ıtem Diffuse-field Equalisation (DFE) ıtem Ambisonic Interaural Level Difference Optimisation (AIO) ıtem Time Alignment (TA) \enditemizeDFE is the removal of direction-independent spectral artefacts in the Ambisonic diffuse-field. AIO augments the gains of the left and right virtual loudspeaker HRTF signals above f_alias such that Ambisonic renders produce more accurate interaural level differences (ILDs). TA is the removal of interaural time differences (ITDs) between the HRTFs above f_alias to reduce high frequency comb filtering effects.The test follows the multiple stimulus with hidden reference and anchors (MUSHRA) paradigm, ITU-R BS.1534-3. Tests are conducted in a quiet listening room using a single set of Sennheiser HD~650 circum-aural headphones and an Apple Macbook Pro with a Fireface UCX audio interface, which has software controlled input and output levels. Headphones are equalised from the RMS average of 11 impulse response measurements, with 1 octave band smoothing in the inverse filter. All audio is 24-bit depth and 48~kHz sample rate. Listening tests are conducted using first, third and fifth order Ambisonics, with respective loudspeaker configurations comprising 6, 26 and 50 loudspeakers, arranged in Lebedev grids. The different test conditions are made up of various combinations of the three pre-processing techniques. The test conditions are as follows:\beginenumerate ıtem HRTF convolution (reference) ıtem Standard Ambisonic (dual band) ıtem Ambisonic with DFE (dual band) ıtem Ambisonic with AIO (dual band) ıtem Ambisonic with AIO & DFE (dual band) ıtem Ambisonic with TA & DFE (basic) ıtem Ambisonic with TA & AIO & DFE (basic) ıtem Ambisonic with TA & AIO & DFE (dual band)\endenumerateThe stimuli are synthesised complex acoustic scenes, defined in this paper as an acoustic scene with multiple sources. The synthesised complex scene used in this paper is composed from 24 freely available stems of an orchestra. Instruments are isolated and empirically matched in loudness. The orchestral stems are panned to the vertices of a 24 pt. T-design arrangement, to ensure minimal overlap between virtual loudspeaker positions in the binaural decoders and the sound sources in the complex scene. Synthesising complex scenes in this way allows for an explicit target reference stimulus - in this case a direct HRTF convolved render. If the Ambisonic stimuli are perfectly reconstructed, they will be equivalent to the reference stimulus. Results are analysed using non-parametric statistics and discussed in the full manuscript. The conclusion suggests the perceptually preferred pre-processing algorithms for virtual loudspeaker binaural Ambisonic rendering.</latex&gt

    Application of Machine Learning for the Spatial Analysis of Binaural Room Impulse Responses

    Get PDF
    Spatial impulse response analysis techniques are commonly used in the field of acoustics, as they help to characterise the interaction of sound with an enclosed environment. This paper presents a novel approach for spatial analyses of binaural impulse responses, using a binaural model fronted neural network. The proposed method uses binaural cues utilised by the human auditory system, which are mapped by the neural network to the azimuth direction of arrival classes. A cascade-correlation neural network was trained using a multi-conditional training dataset of head-related impulse responses with added noise. The neural network is tested using a set of binaural impulse responses captured using two dummy head microphones in an anechoic chamber, with a reflective boundary positioned to produce a reflection with a known direction of arrival. Results showed that the neural network was generalisable for the direct sound of the binaural room impulse responses for both dummy head microphones. However, it was found to be less accurate at predicting the direction of arrival of the reflections. The work indicates the potential of using such an algorithm for the spatial analysis of binaural impulse responses, while indicating where the method applied needs to be made more robust for more general application

    3D Reflector Localisation and Room Geometry Estimation using a Spherical Microphone Array

    Get PDF
    The analysis of room impulse responses to localise reflecting surfaces and estimate room ge- ometry is applicable in numerous aspects of acoustics, including source localisation, acoustic simulation, spatial audio, audio forensics, and room acoustic treatment. Geometry inference is an acoustic analysis problem where information about reflections extracted from impulse responses are used to localise reflective boundaries present in an environment, and thus estimate the geometry of the room. This problem however becomes more complex when considering non-convex rooms, as room shape can not be constrained to a subset of possible convex polygons. This paper presents a geometry inference method for localising reflective boundaries and inferring the room’s geometry for convex and non-convex room shapes. The method is tested using simulated room impulse responses for seven scenarios, and real-world room impulse responses measured in a cuboid-shaped room, using a spherical microphone array containing multiple spatially distributed channels capable of capturing both time- and direction-of-arrival. Results show that the general shape of the rooms is inferred for each case, with a higher degree of accuracy for convex shaped rooms. However, inaccuracies gen- erally arise as a result of the complexity of the room being inferred, or inaccurate estimation of time- and direction-of-arrival of reflections

    Towards a perceptually optimal bias factor for directional bias equalisation of binaural ambisonic rendering

    Get PDF
    International audienceAmbisonics has enjoyed a recent resurgence in popularity due to virtual reality applications, where Ambisonic audio is presented to the user binaurally in conjunction with a head-mounted display. In this scenario however, it is imperative to maximise the coherence between audio in the frontal direction and visuals in order to maintain immersion.Ambisonic reproduction can theoretically be perfect in the centre of the loudspeaker array for frequencies up to the `spatial aliasing frequency', f_alias. At frequencies above f_alias however, the limited spatial accuracy of reproducing a physical sound field with a finite number of transducers causes artefacts such as localisation blur, reduced lateralisation and comb filtering. One approach for improving spectral reproduction of binaural Ambisonic rendering is diffuse-field equalisation. In a previous study, the authors applied this technique to virtual loudspeaker binaural Ambisonic decoders, which improved both the spectral response over the sphere and predicted median plane elevation localisation. However, there still exists a perceivable difference in timbre between diffuse-field equalised binaural Ambisonic rendering and HRTF convolution. By altering the diffuse-field equalisation method by concentrating the equalisation for one specific direction (for the virtual reality application this direction is the front), it is possible to create a hybrid of free-field and diffuse-field equalisation such that frontal reproduction becomes more accurate, at the expense of other directions. This is referred to as directional bias equalisation (DBE) and is a two-stage equalisation process. The first obtains a frontal directionally biased spherical field response and equalises it, and the second re-equalises this response for the ideal corresponding frontal response such that an infinite directional bias will produce frontal Ambisonic audio equivalent to frontal HRTF convolution. DBE is a pre-processing stage that can be implemented offline. Increasing the frontal bias factor (represented in this paper using the letter \kappa) improves spectral reproduction of frontal sounds to a greater extent, though this comes at the expense of spectral accuracy at other directions. This paper presents the results of a perceptual listening test that attempts to determine the optimal bias factor for an appropriate trade off between improved frontal fidelity and reduced lateral fidelity. The test follows the multiple stimulus with hidden reference and anchors (MUSHRA) paradigm, ITU-R BS.1534-3. Tests are conducted in a quiet listening room using a single set of Sennheiser HD~650 circum-aural headphones and an Apple Macbook Pro with a Fireface UCX audio interface, which has software controlled input and output levels. Headphones are equalised from the RMS average of 11 impulse response measurements, with 1 octave band smoothing in the inverse filter. All audio is 24-bit depth and 48~kHz sample rate. Listening tests are conducted using first, third and fifth order Ambisonics, with loudspeaker configurations comprising 6, 26 and 50 loudspeakers respectively, arranged in Lebedev grids. Test conditions are as follows:\beginitemize ıtem HRTF convolution (reference) ıtem Standard Ambisonic ıtem DBE Ambisonic with \kappa = 1 ıtem DBE Ambisonic with \kappa = 3 ıtem DBE Ambisonic with \kappa = 5 ıtem DBE Ambisonic with \kappa = 9 ıtem DBE Ambisonic with \kappa = 17 ıtem DBE Ambisonic with \kappa = 33\enditemizeAdditional stimuli are a low and mid anchor for the simple scenes, comprised of the HRTF reference low-passed at 3.5~kHz and 7~kHz respectively, and a zeroth order Ambisonic render for the complex scenes. Two simple and one complex scenes are used. The simple scenes comprise of a single pink noise source. The first is panned directly in front of the listener at (θ,φ) = (0^\circ,0^\circ), and the second is panned directly to the left of the listener at (θ,φ) = (90^\circ,0^\circ). The complex scene is simulated by mixing a frontal pink noise and a diffuse soundscape. The noise burst consists of 0.5s burst and 0.5s of silence. The diffuse soundscape is synthesised from 24 decorrelated monophonic sound scene recordings of a train station. The 24 sources are panned to the vertices of a 24 pt. T-design quadrature, to ensure minimal overlap between virtual loudspeaker positions in the binaural decoders and the sound sources in the complex scene. The frontal noise is 3~dB~RMS louder than the diffuse soundscape. Results are analysed using non-parametric statistics and discussed in the full manuscript. The conclusion suggests the perceptually optimal bias factor for improved frontal spectral fidelity with minimal perceived lateral degradation for virtual loudspeaker binaural Ambisonic rendering.</latex&gt

    Diphthong Synthesis Using the Dynamic 3D Digital Waveguide Mesh

    Get PDF
    Articulatory speech synthesis has the potential to offer more natural sounding synthetic speech than established concatenative or parametric synthesis methods. Time-domain acoustic models are particularly suited to the dynamic nature of the speech signal, and recent work has demonstrated the potential of dynamic vocal tract models that accurately reproduce the vocal tract geometry. This paper presents a dynamic 3D digital waveguide mesh (DWM) vocal tract model, capable of movement to produce diphthongs. The technique is compared to existing dynamic 2D and static 3D DWM models, for both monophthongs and diphthongs. The results indicate that the proposed model provides improved formant accuracy over existing DWM vocal tract models. Furthermore, the computational requirements of the proposed method are significantly lower than those of comparable dynamic simulation techniques. This paper represents another step toward a fully functional articulatory vocal tract model which will lead to more natural speech synthesis systems for use across society

    Impulse Response Estimation for the Auralisation of Vehicle Engine Sounds Using Dual Channel FFT Analysis

    Get PDF
    A method is presented to estimate the impulse response of a filter that describes the transformation in sound that takes place between a close-mic recording of a vehicle engine and the sound of the same engine at another point in or near to the vehicle. The proposed method makes use of the Dual Channel FFT Analysis technique and does not require the use of loudspeakers, computer modelling or mechanical devices. Instead, a minimum of two microphones is required and the engine itself is used as the source of sound. This is potentially useful for virtual reality applications or in sound design for computer games, where users select their virtual position at points inside or outside the vehicle. A case study is described to examine the method in practice and the results are discussed. The described method can be readily extended for surround sound applications using spatial microphone array recording techniques

    Establishment and Implementation of Guidelines for Narrative Audio-based Room-scale Virtual Reality using Practice-based Methods

    Get PDF
    Room-scale Virtual Reality (VR) presents sound designers with new challenges to tell stories with audio in games with player-driven narratives. These challenges arise from the player moving in and interacting with the virtual environment. The paper performs a small scoping review of VR/non-VR games and associated literature to identify issues and solutions to the placement of speech-based audio using practice-based research methods. The review leads to the proposition of design guidelines and strategies for their implementation. The paper advocates that each instance of speech-based audio should be short, interactive, and complemented by non-speech audio. Furthermore, each instance’s spatial, interactive, visual, aural, and narrative representation should be considered in combination. The paper also suggests that 3D-binaural audio informed by physics can aid storytelling and make virtual environments player-responsive
    corecore